#PART 1: Read csv, merge, clean and plot outliers.
library(readr)
library(readxl)
library(dplyr)
library(countrycode)
library(car)
source('Read_Clean.R')
cleaned <- Read_Clean()
The data set contains 17 variables. In order to easly present them, dimension reduction analysis were computed. Two techniques were used: Multidimensional Scaling (MDS) and Principal Component Analysis (PCA).
In order to provide a general insignis into the data, all countries were presented in 3-dimensional space. At the first glance clusters between continents can be seen. Countries which are in the same continent in general present a similar profile. The most diverse continent is Asia with many outliers. Countries in Asia spread from Europe (the one end) and Africa (the second end). It can be seen that North America and South America are similar to each other.
image
library(scatterplot3d)
## Warning: package 'scatterplot3d' was built under R version 3.5.2
source('MDS.R')
## Warning in system(my_command): 'rm' not found
The Principal Component Analysis (PCA) was used in order to provide insights into the data and visualize it in two-dimensional plots. Three principal components were presented on the plots below and their cumulative proportion of variance is 74%.
Interpretation of PC1, PC2, and PC3 is as follows: PC1: is highly loaded in variables such as number of phones, life expectancy, Corruption index, Acces to the Internet and Income.
PC2: is highly loaded in the number of suicides and sex ratio. PC3: is especially meaningful in the context of inequality.
image
In order to present the PCA result, two graphs were displayed below. Ellipses were added to the graphs which shows a concentration of points. Ther size is influenced by outliers.
Plot PC1 vs PC2 -On the right of the plot with a high value of PC1 hight developed countries in Europe and North America can be spotted. Those contents are above the average in the context of Less corruption, Life expectancy, Internet access, number of phones and income.
On the left side of the plot, with a low value of PC1 not less developed countries in Africa can be spotted. Those countries are above the average in the context of high child mortality, a number of children per woman and inequality.
Interesting phenomena is presented by looking at Asia. The continent is the most diverse among all of the others in both directions PC1 and PC2. Some countries in Asia are highly developed while others are rather poor (PC1). In the context of PC2, some countries have extreme value for sex ratio (men outnumber women significantly). Those countries are Qatar and the UAE.
PC1 and PC2 do not give us many insights into Central America nor South America. Since values for these continents are in the middle of the plot.
PrinCompPlot[1]
## [[1]]
Plot PC2 vs PC3 The second plot shows that very high inequality is presented especially in South America and Africa
The above plots show also that there is a high correlation between variables: number of phones, less corruption, Internet access and income. Another group of highly correlated variables are child mortality and a number of children per woman.
PrinCompPlot[2]
## [[1]]
PCA on the World Map In order to show which countries are the highest in what Principal Component the World Map was presented. From each component (PC1, PC2, and PC3) top 15 countries with the highest loading in each group were chosen and plotted on the map.
PrinCompPlot <- PCA(cleaned)
Note: from the analysis columns such as Population total, number of murder, number of armed forces, urban population total and percentage of investments are excluded. Those variables had a low correlation with the rest of the columns and much more dimensions would be needed to explain the data. As such information would not be possible to be explained in 2-dimensional plot.
# PART 3: Hierarchical Clustering between Continents
library(ape)
source('cluster_continents.R')
Cl_continents <- cluster_continents(cleaned)
Include all variables
South, North and Europe are very similar. AND C America, Asia, Oceania and Africa are similar. Interesting is Africa is clustered with Oceania (with include Australia and NZ but also many small island which push Oceania into level of Africa)
Jereamy both
# PART 4: K-means & Model Based Clustering between Countries
library(mclust)
source('clusters_countries.R')
Cl_countries <- clusters_countries(cleaned)
compare chi.square test -> dependency between groups and continents. Model based groups are more similar to continents.
model based (group7) difficult name (result for this group) pop_total murder_pp armed_pp phones_p100 children_p_woman life_exp_yrs suicide_pp urban_pop_tot sex_ratio_p100 [1,] 239114394 0 0.011 146.317 2.07 78.053 0 118882678 148.681 corruption_CPI internet_%of_pop child_mort_p1000 income_per_person investments_per_ofGDP gini [1,] 53.677 75.725 10.791 50579.31 29.942 39.722
Developed countries are split into 3 groups.
poor countries are the same in both models
we lost “crowded” group from k-means. It transfoms into group 7 which describe high Income, Sex Ratio, Population, phones
In order to find the number of factors, EFA was performed starting with 1 factor, increasing the number of factors until getting a value for RMSE lower than 0.05. Therefore, it was concluded that the optimal number of factors is four. Performing EFA with 4 factors, the loadings are:
#PART 5: EFA
source('EFA.R')
EFA_loadings(cleaned)
##
## Loadings:
## Factor1 Factor2 Factor3 Factor4
## pop_total 0.995
## murder_pp 0.825
## armed_pp
## phones_p100 0.615
## children_p_woman -0.918
## life_exp_yrs 0.875
## suicide_pp
## urban_pop_tot 0.958
## sex_ratio_p100 0.538
## corruption_CPI 0.538
## internet_%of_pop 0.847
## child_mort_p1000 -0.940
## income_per_person 0.575 0.768
## investments_per_ofGDP
## gini 0.625
##
## Factor1 Factor2 Factor3 Factor4
## SS loadings 4.439 1.949 1.262 1.153
## Proportion Var 0.296 0.130 0.084 0.077
## Cumulative Var 0.296 0.426 0.510 0.587
From the loadings it can be interpreted:
1. Factor 1 has high life expectancy, internet access, balanced income per person and it is low in child mortality and children per women. For these reasons, represents the level of development of the country. 2. Factor 2 represents the level of population. 3. Factor 3 represents inequality and murder. 4. Factor 4 represents the level of income related with the amount of men and women that the country has. In order to visualize these four factors graphically, the top 10 for each factor’s scores was taken and create four groups of countries, where each factor has more relevance. The groups of countries are named according with the meaning of each factor as follows: Factor 1 -> Developed Factor 2 -> Crowed Factor 3 -> Inequality Factor 4 -> Gender/Income These can be visualized in the following graph:
source('EFA.R')
groups = EFA_plot(cleaned)
Note: There are some countries such as Singapore or Qatar that are in the groups but are too small to show in the map.
The countries in group 1 are:
library(knitr)
print(groups[1])
[[1]][1] “Spain” “Estonia” “Finland” “Switzerland”
[5] “Andorra” “Austria” “Singapore” “Liechtenstein” [9] “South Korea” “Japan”
The countries in group 2 are:
library(knitr)
print(groups[2])
[[1]][1] “Japan” “Russia” “Bangladesh” “Pakistan”
[5] “Nigeria” “Brazil” “Indonesia” “United States” [9] “India” “China”
The countries in group 3 are:
library(knitr)
print(groups[3])
[[1]][1] “Brunei Darussalam” “Swaziland” “Honduras”
[4] “Brazil” “Guatemala” “Colombia”
[7] “Venezuela” “Lesotho” “South Africa”
[10] “El Salvador”
The countries in group 4 are:
library(knitr)
print(groups[4])
[[1]][1] “Saudi Arabia” “Monaco” “Norway”
[4] “Ireland” “Kuwait” “Brunei”
[7] “Singapore” “United Arab Emirates” “Luxembourg”
[10] “Qatar”
#PART 6: CFA
#???????